Skip to content

feat: add document classification pipeline#6

Merged
caio-pizzol merged 2 commits intomainfrom
caio-pizzol/classification-pipeline
Mar 9, 2026
Merged

feat: add document classification pipeline#6
caio-pizzol merged 2 commits intomainfrom
caio-pizzol/classification-pipeline

Conversation

@caio-pizzol
Copy link
Contributor

Python ML pipeline (scripts/classification/) that classifies ~800K .docx documents by document type (10 classes) and topic (9 classes) using the FineWeb-Edu pattern: LLM labels a sample → train ModernBERT → apply at scale.

Pipeline steps:

  • sample.py: stratified sampling across languages and word count
  • label.py: async LLM labeling with Claude (resumable)
  • train.py: fine-tune two ModernBERT classifiers
  • classify.py: batch inference on full corpus
  • evaluate.py: quality metrics and distribution analysis

Also adds:

  • LLM classification fields and methods to DbClient
  • CLAUDE.md / AGENTS.md at root, packages/shared, and scripts/classification
  • Updated README with Phase 5 (Classify) and project structure

Python ML pipeline (scripts/classification/) that classifies ~800K .docx
documents by document type (10 classes) and topic (9 classes) using the
FineWeb-Edu pattern: LLM labels a sample → train ModernBERT → apply at scale.

Pipeline steps:
- sample.py: stratified sampling across languages and word count
- label.py: async LLM labeling with Claude (resumable)
- train.py: fine-tune two ModernBERT classifiers
- classify.py: batch inference on full corpus
- evaluate.py: quality metrics and distribution analysis

Also adds:
- LLM classification fields and methods to DbClient
- CLAUDE.md / AGENTS.md at root, packages/shared, and scripts/classification
- Updated README with Phase 5 (Classify) and project structure
@codecov
Copy link

codecov bot commented Mar 9, 2026

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

- Merge train_modal.py into train.py (--modal flag)
- Merge classify_modal.py into classify.py (--modal --workers N)
- Switch base model to xlm-roberta-base (multilingual)
- Add class-weighted loss for imbalanced classes
- Add --exclude flag to sample.py for iterative sampling
- Gitignore models/ and *.jsonl artifacts
- Update docs for Modal setup and cost estimates
@caio-pizzol caio-pizzol merged commit d1c7983 into main Mar 9, 2026
1 check passed
@caio-pizzol caio-pizzol deleted the caio-pizzol/classification-pipeline branch March 9, 2026 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants